-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add node REPLACE shutdown implementation #76247
Add node REPLACE shutdown implementation #76247
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a first look and have a few questions.
Other than that my main comment is that we will be putting a lot of trust onto the caller, since if they do it in any way wrongly, nodes will run out of disk space or mess things up (for instance put regular shards on frozen nodes). This might be ok but it feels a bit like removing the guardrails. Will give this some more thought...
...a/org/elasticsearch/cluster/routing/allocation/decider/NodeReplacementAllocationDecider.java
Outdated
Show resolved
Hide resolved
...in/java/org/elasticsearch/cluster/routing/allocation/decider/NodeReplaceOverrideWrapper.java
Outdated
Show resolved
Hide resolved
...in/java/org/elasticsearch/cluster/routing/allocation/decider/NodeReplaceOverrideWrapper.java
Outdated
Show resolved
Hide resolved
...er/src/main/java/org/elasticsearch/cluster/routing/allocation/decider/AllocationDecider.java
Outdated
Show resolved
Hide resolved
1d987c8
to
f2792ab
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few comments on the new approach.
...er/src/main/java/org/elasticsearch/cluster/routing/allocation/decider/AllocationDecider.java
Outdated
Show resolved
Hide resolved
...src/main/java/org/elasticsearch/cluster/routing/allocation/decider/DiskThresholdDecider.java
Outdated
Show resolved
Hide resolved
.../main/java/org/elasticsearch/cluster/routing/allocation/decider/FilterAllocationDecider.java
Outdated
Show resolved
Hide resolved
...er/src/main/java/org/elasticsearch/cluster/routing/allocation/decider/AllocationDecider.java
Outdated
Show resolved
Hide resolved
.../main/java/org/elasticsearch/cluster/routing/allocation/decider/FilterAllocationDecider.java
Outdated
Show resolved
Hide resolved
...a/org/elasticsearch/cluster/routing/allocation/decider/NodeReplacementAllocationDecider.java
Outdated
Show resolved
Hide resolved
} else if (isReplacementTarget(allocation.metadata(), node.getId())) { | ||
return Decision.single(Decision.Type.NO, NAME, | ||
"node [%s] is a node replacement target, shards cannot auto expand to be on it until the replacement is complete", | ||
node.getId(), "source"); | ||
} else if (isReplacementSource(allocation.metadata(), node.getId())) { | ||
return Decision.single(Decision.Type.NO, NAME, | ||
"node [%s] is being replaced, shards cannot auto expand to be on it", node.getId()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it could be ok, but maybe worth discussing anyway if we should count a replacing source,target pair as one for auto-expand?
The benefit of contracting auto-expand is obviously that we might drop a replica on the source and thus speed up the time. On the other hand, we are jeopardizing data safety.
At least it means REPLACE has some "express-vacate" semantics, i.e., should not be used for softer vacates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was on the fence about this also, I don't have a strong feeling about this, do you think we should go one way or another?
The difficulty with considering a source-target pair as one for auto-expand is that it could have been invalid to expand to the source node for other reasons, but valid on the target node, and that would but it in the "could be dangerous to add another shard to the target node" category. Hence, for safety I went with the "neither" approach for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More than the classical "give me a copy on every node for search performance" I am more worried about our use of 0-1
on system indices like security indices.
In particular in the replace case, we risk that the reduce of replica count causes the remaining copy to reside on the source node, waiting to be vacated. Given that we are running against time here, that adds some risk of loosing the last copy of data. To become safe, we should also ensure that any copies on the target/source node is the one that we get rid of, which would be complicated.
Regarding your case of it being invalid on the source node, I think there are really two cases there:
- The shard already exists on the source node (i.e., the node was excluded somehow but the shard has not yet moved off).
- The shard does not exist on the source node (but on some other nodes).
In neither case will we allow establishing a replica on the target node, NodeReplacementAllocationDecider
will prevent that. Only relocations to target node will pass through canAllocate
.
In the 1st case we would in many cases have to vacate the shard anyway. So the benefit is only in the case where we are lucky that the copy on the source node is the one we remove. Already an edge case since the shard is not supposed to reside on the source node altogether.
In the 2nd case there would be no additional benefit of not reducing replica count and no drawback of increasing it.
I suppose this is again a matter of data safety vs how sure we are the replace succeeds. I am inclined to prefer data safety 🙂, in particular in this case since I think the drawbacks are minor and luck-related.
...java/org/elasticsearch/cluster/routing/allocation/decider/NodeShutdownAllocationDecider.java
Outdated
Show resolved
Hide resolved
...a/org/elasticsearch/cluster/routing/allocation/decider/NodeReplacementAllocationDecider.java
Outdated
Show resolved
Hide resolved
...a/org/elasticsearch/cluster/routing/allocation/decider/NodeReplacementAllocationDecider.java
Show resolved
Hide resolved
e59ba7a
to
c93054c
Compare
...er/src/main/java/org/elasticsearch/cluster/routing/allocation/decider/AllocationDecider.java
Outdated
Show resolved
Hide resolved
...ain/java/org/elasticsearch/cluster/routing/allocation/decider/MaxRetryAllocationDecider.java
Show resolved
Hide resolved
b6505b2
to
08eda8d
Compare
08eda8d
to
92770f0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this is looking good! Left a few comments but nothing major.
...ain/java/org/elasticsearch/cluster/routing/allocation/allocator/BalancedShardsAllocator.java
Outdated
Show resolved
Hide resolved
...a/org/elasticsearch/cluster/routing/allocation/decider/NodeReplacementAllocationDecider.java
Outdated
Show resolved
Hide resolved
...a/org/elasticsearch/cluster/routing/allocation/decider/NodeReplacementAllocationDecider.java
Outdated
Show resolved
Hide resolved
...a/org/elasticsearch/cluster/routing/allocation/decider/NodeReplacementAllocationDecider.java
Outdated
Show resolved
Hide resolved
...a/org/elasticsearch/cluster/routing/allocation/decider/NodeReplacementAllocationDecider.java
Outdated
Show resolved
Hide resolved
A separate problem with the current implementation is that we will not allow allocating replicas to a target node where data copies already exists. This could be a problem if the target node falls out of the cluster for a short while and then joins again. This poses following potential issues:
We should look into allowing to allocate the replicas back to the target node to avoid this. There are a few options to consider:
I think option 2 is the best one. It can be argued that by allowing the allocation, we risk more data on the target node. But the amount should be similar to what we would have had if the node had stayed in the cluster and therefore this seems ok. |
c9fe05a
to
2402803
Compare
Pinging @elastic/es-core-infra (Team:Core/Infra) |
@gwbrown @henningandersen okay, I'm slightly concerned that the complexity of this PR is getting a little out of hand (at least it's hard for me to follow since it's been so long-lived). If it's okay with you I think I'd rather lay the base down for node replacement in this PR and then work on the other scenarios in separate work. As far as I know, there are two things that this PR doesn't address:
If you agree that those can be done in subsequent work, could you both take a fresh look at this PR? (I promise I won't force-push any more now that it's not a draft PR) |
@henningandersen thanks for taking another look, I think I addressed all your comments again |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for iterating on this with me. I hope @gwbrown can find the time for a review too.
internalCluster().startNode(Settings.builder().put("node.name", nodeB)); | ||
internalCluster().startNode(Settings.builder().put("node.name", nodeC)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can swap the order of starting nodes? Suppose we allowed allocating to both nodes (a bug). But we then risk the shards moving over to B (they only have to start the relocation) before node C have started?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do have another test that starts nodeC prior to nodeB for the replacement (it was one of the changes you asked for), which I think satisfies this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I did notice that and thanks for doing that. I was only adding this comment here because this test is named testNodeReplacementOnlyToTarget
and it seems wrong to then not have another node that shards could be allocated to. But I agree that the case is covered so you can consider this comment optional for sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks Lee!
// Create an index and pin it to nodeA, when we replace it with nodeB, | ||
// it'll move the data, overridding the `_name` allocation filter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we update this comment, looks like a copy-paste error, since we do not pin the index here?
💔 Backport failed
You can use sqren/backport to manually backport by running |
* WIP, basic implementation * Pull `if` branch into a variable * Remove outdated javadoc * Remove map iteration, use target name instead of id (whoops) * Remove streaming from isReplacementSource * Simplify getReplacementName * Only calculate node shutdowns if canRemain==false and forceMove==false * Move canRebalance comment in BalancedShardsAllocator * Rename canForceDuringVacate -> canForceAllocateDuringReplace * Add comment to AwarenessAllocationDecider.canForceAllocateDuringReplace * Revert changes to ClusterRebalanceAllocationDecider * Change "no replacement" decision message in NodeReplacementAllocationDecider * Only construct shutdown map once in isReplacementSource * Make node shutdowns and target shutdowns available within RoutingAllocation * Add randomization for adding the filter that is overridden in test * Add integration test with replicas: 1 * Go nuts with the verbosity of allocation decisions * Also check NODE_C in unit test * Test with randomly assigned shard * Fix test for extra verbose decision messages * Remove canAllocate(IndexMetadat, RoutingNode, RoutingAllocation) overriding * Spotless :| * Implement 100% disk usage check during force-replace-allocate * Add rudimentary documentation for "replace" shutdown type * Use RoutingAllocation shutdown map in BalancedShardsAllocator * Add canForceAllocateDuringReplace to AllocationDeciders & add test * Switch from percentage to bytes in DiskThresholdDecider force check * Enhance docs with note about rollover, creation, & shrink * Clarify decision messages, add test for target-only allocation * Simplify NodeReplacementAllocationDecider.replacementOngoing * Start nodeC before nodeB in integration test * Spotleeeessssssss! You get me every time! * Remove outdated comment # Conflicts: # x-pack/plugin/shutdown/src/test/java/org/elasticsearch/xpack/shutdown/TransportGetShutdownStatusActionTests.java
* WIP, basic implementation * Pull `if` branch into a variable * Remove outdated javadoc * Remove map iteration, use target name instead of id (whoops) * Remove streaming from isReplacementSource * Simplify getReplacementName * Only calculate node shutdowns if canRemain==false and forceMove==false * Move canRebalance comment in BalancedShardsAllocator * Rename canForceDuringVacate -> canForceAllocateDuringReplace * Add comment to AwarenessAllocationDecider.canForceAllocateDuringReplace * Revert changes to ClusterRebalanceAllocationDecider * Change "no replacement" decision message in NodeReplacementAllocationDecider * Only construct shutdown map once in isReplacementSource * Make node shutdowns and target shutdowns available within RoutingAllocation * Add randomization for adding the filter that is overridden in test * Add integration test with replicas: 1 * Go nuts with the verbosity of allocation decisions * Also check NODE_C in unit test * Test with randomly assigned shard * Fix test for extra verbose decision messages * Remove canAllocate(IndexMetadat, RoutingNode, RoutingAllocation) overriding * Spotless :| * Implement 100% disk usage check during force-replace-allocate * Add rudimentary documentation for "replace" shutdown type * Use RoutingAllocation shutdown map in BalancedShardsAllocator * Add canForceAllocateDuringReplace to AllocationDeciders & add test * Switch from percentage to bytes in DiskThresholdDecider force check * Enhance docs with note about rollover, creation, & shrink * Clarify decision messages, add test for target-only allocation * Simplify NodeReplacementAllocationDecider.replacementOngoing * Start nodeC before nodeB in integration test * Spotleeeessssssss! You get me every time! * Remove outdated comment # Conflicts: # x-pack/plugin/shutdown/src/test/java/org/elasticsearch/xpack/shutdown/TransportGetShutdownStatusActionTests.java
* master: Fix DataTierTests package and add a validation test (elastic#78880) Fix split package org.elasticsearch.common.xcontent (elastic#78831) Store DataTier Preference directly on IndexMetadata (elastic#78668) [DOCS] Fixes typo in calendar API example (elastic#78867) Improve Node Shutdown Observability (elastic#78727) Convert encrypted snapshot license object to LicensedFeature (elastic#78731) Revert "Make nodePaths() singular (elastic#72514)" (elastic#78801) Fix incorrect generic type in PolicyStepsRegistry (elastic#78628) [DOCS] Fixes ML get calendars API (elastic#78808) Implement GET API for System Feature Upgrades (elastic#78642) [TEST] More MetadataStateFormat tests (elastic#78577) Add support for rest compatibility headers to the HLRC (elastic#78490) Un-ignoring tests after backporting fix (elastic#78830) Add node REPLACE shutdown implementation (elastic#76247) Wrap VersionPropertiesLoader in a BuildService to decouple build logic projects (elastic#78704) Adjust /_cat/templates not to request all metadata (elastic#78829) [DOCS] Fixes ML get scheduled events API (elastic#78809) Enable exit on out of memory error (elastic#71542) # Conflicts: # server/src/main/java/org/elasticsearch/cluster/metadata/DataStream.java
This commit enhances `DiskThresholdMonitor` so that indices that have a flood-stage block will not have the block removed while they reside on a node that is part of a "REPLACE"-type node shutdown. This prevents a situation where a node is blocked due to disk usage, then during the replacement the block is removed while shards are relocating to the target node, indexing occurs, and then the target runs out of space due to the additional documents. Relates to elastic#70338 and elastic#76247
#78942) This commit enhances `DiskThresholdMonitor` so that indices that have a flood-stage block will not have the block removed while they reside on a node that is part of a "REPLACE"-type node shutdown. This prevents a situation where a node is blocked due to disk usage, then during the replacement the block is removed while shards are relocating to the target node, indexing occurs, and then the target runs out of space due to the additional documents. Relates to #70338 and #76247
elastic#78942) This commit enhances `DiskThresholdMonitor` so that indices that have a flood-stage block will not have the block removed while they reside on a node that is part of a "REPLACE"-type node shutdown. This prevents a situation where a node is blocked due to disk usage, then during the replacement the block is removed while shards are relocating to the target node, indexing occurs, and then the target runs out of space due to the additional documents. Relates to elastic#70338 and elastic#76247 # Conflicts: # server/src/test/java/org/elasticsearch/cluster/routing/allocation/DiskThresholdMonitorTests.java
#78942) (#79008) This commit enhances `DiskThresholdMonitor` so that indices that have a flood-stage block will not have the block removed while they reside on a node that is part of a "REPLACE"-type node shutdown. This prevents a situation where a node is blocked due to disk usage, then during the replacement the block is removed while shards are relocating to the target node, indexing occurs, and then the target runs out of space due to the additional documents. Relates to #70338 and #76247 # Conflicts: # server/src/test/java/org/elasticsearch/cluster/routing/allocation/DiskThresholdMonitorTests.java
This commit allows replica shards that have existing data on disk to be re-allocated to the target of a "REPLACE" type node shutdown. Prior to this if the target node of a shutdown were to restart, the replicas would not be allowed to be allocated even if their data existed on disk. Relates to elastic#70338 as a follow-up to elastic#76247
…ement (#79171) This commit allows replica shards that have existing data on disk to be re-allocated to the target of a "REPLACE" type node shutdown. Prior to this if the target node of a shutdown were to restart, the replicas would not be allowed to be allocated even if their data existed on disk. Relates to #70338 as a follow-up to #76247
…ement (elastic#79171) This commit allows replica shards that have existing data on disk to be re-allocated to the target of a "REPLACE" type node shutdown. Prior to this if the target node of a shutdown were to restart, the replicas would not be allowed to be allocated even if their data existed on disk. Relates to elastic#70338 as a follow-up to elastic#76247
…ement (#79171) (#79266) This commit allows replica shards that have existing data on disk to be re-allocated to the target of a "REPLACE" type node shutdown. Prior to this if the target node of a shutdown were to restart, the replicas would not be allowed to be allocated even if their data existed on disk. Relates to #70338 as a follow-up to #76247
This adds the implementation for the
REPLACE
shutdown type. Currently a shutdown requires a target node and moves the shutdown node's data to the target node, ignoring certain allocation deciders.Relates to #70338